AITopics | human video data

Collaborating Authors

human video data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Pre-training Auto-regressive Robotic Models with 4D Representations

Niu, Dantong, Sharma, Yuvan, Xue, Haoru, Biamby, Giscard, Zhang, Junyi, Ji, Ziteng, Darrell, Trevor, Herzig, Roei

arXiv.org Artificial IntelligenceFeb-18-2025

This could potentially be attributed to the scarcity of large-scale, Foundation models pre-trained on massive unlabeled diverse robotic data, unlike the abundance of text and image datasets have revolutionized natural language data available for vision and language FMs. and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the The lack of robotic data poses a significant bottleneck in importance of pre-training. Yet, efforts in robotics training foundation models that can effectively generalize have struggled to achieve similar success, limited across diverse robotic platforms and tasks. To overcome this by either the need for costly robotic annotations or limitation, several recent approaches (Xiao et al., 2022; Ye the lack of representations that effectively model et al., 2024) employ representation learning by pre-training the physical world. In this paper, we introduce on an abundance of human data, enabling transfer to robotic ARM4R, an Auto-regressive Robotic Model that systems. These approaches aim to recognize the inherent leverages low-level 4D Representations learned similarities between human and robot manipulation tasks from human video data to yield a better pretrained and exploit the vast repositories of human video data available robotic model. Specifically, we focus on on the internet. Yet, these approaches have not been utilizing 3D point tracking representations from able to demonstrate effective generalization to downstream videos derived by lifting 2D representations into tasks. In part, this is due to their representations lacking an 3D space via monocular depth estimation across understanding of the physical world (Zhen et al., 2024a), time. These 4D representations maintain a shared and therefore being less effective for robotics.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2502.13142

Genre: Research Report (0.83)

Industry: Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Action-Free Reasoning for Policy Generalization

Clark, Jaden, Mirchandani, Suvir, Sadigh, Dorsa, Belkhale, Suneel

arXiv.org Artificial IntelligenceFeb-10-2025

End-to-end imitation learning offers a promising approach for training robot policies. However, generalizing to new settings remains a significant challenge. Although large-scale robot demonstration datasets have shown potential for inducing generalization, they are resource-intensive to scale. In contrast, human video data is abundant and diverse, presenting an attractive alternative. Yet, these human-video datasets lack action labels, complicating their use in imitation learning. Existing methods attempt to extract grounded action representations (e.g., hand poses), but resulting policies struggle to bridge the embodiment gap between human and robot actions. We propose an alternative approach: leveraging language-based reasoning from human videos-essential for guiding robot actions-to train generalizable robot policies. Building on recent advances in reasoning-based policy architectures, we introduce Reasoning through Action-free Data (RAD). RAD learns from both robot demonstration data (with reasoning and action labels) and action-free human video data (with only reasoning labels). The robot data teaches the model to map reasoning to low-level actions, while the action-free data enhances reasoning capabilities. Additionally, we will release a new dataset of 3,377 human-hand demonstrations with reasoning annotations compatible with the Bridge V2 benchmark and aimed at facilitating future research on reasoning-driven robot learning. Our experiments show that RAD enables effective transfer across the embodiment gap, allowing robots to perform tasks seen only in action-free data. Furthermore, scaling up action-free reasoning data significantly improves policy performance and generalization to novel tasks. These results highlight the promise of reasoning-driven learning from action-free datasets for advancing generalizable robot control. Project page: https://rad-generalization.github.io

artificial intelligence, human video data, reasoning, (14 more...)

arXiv.org Artificial Intelligence

2502.03729

Country:

Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.50)

Add feedback

R3M: A Universal Visual Representation for Robot Manipulation

Nair, Suraj, Rajeswaran, Aravind, Kumar, Vikash, Finn, Chelsea, Gupta, Abhinav

arXiv.org Artificial IntelligenceNov-18-2022

We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2203.12601

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Oregon (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots > Manipulation (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)

Add feedback